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Abstract 

Artificial Intelligence (AI) is a wide-ranging branch of computer science that 
deals with the construction of smart machines that typically require human 
intelligence. Machine learning is a method of data analysis that automates the 
building of analytics models. It is based on the theory that the system can 
learn from statistics, identify patterns and make decisions with minimal 
human intervention. Artificial intelligence and machine learning are 
interconnected fields. Bioinformatics approach has been used to address 
numerous biological problems. Machine learning and AI are revolutionizing 
computational biology and bioinformatics. There has been progressive 
advancements in computer sciences and bioinformatics. The use of AI and 
machine learning in bioinformatics are helping to make new ways for 
biological data management and to perform different analyses for logical 
conclusions. This article describes the use of some key tools of applied AI 
and machine learning in bioinformatics. 


This work is licensed under the Creative Commons Attribution Non- 
Commercial 4.0 International License. 
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Introduction 


Bioinformatics is an interdisciplinary field, 
promising the preeminent analysis of huge biological 
data with powerful computing and mathematical 
techniques [1]. The advancement in bioinformatics 
solves many complex biological problems. 
Bioinformatics approach has been used to address 
numerous biological problems. Artificial intelligence 
and machine learning are interconnected fields. Both 
of these technologies are highly trending 
technologies used to create intelligent systems. 
Artificial intelligence systems do not need to be 
reprogrammed, but instead, they use algorithms that 
can work with their own intelligence. In machine 
learning machines are trained by providing data to 
make future prediction by learning the pattern of data 
provided as input. Use of AI in structural 
bioinformatics tools is an effective way to design 
novel compounds against neurological disorders. 
Today, most of the biological work is based on the 
prediction of biological systems [2-5]. To solve the 
biological functions through bioinformatics 
approaches helps to treat the diseases and other 
biological processes [6]. The mutational analyses, 
protein structures and protein functions are 
considered as an effective area to solve through 
bioinformatics approaches. However, the complexity 
of the biological data and the amount of it can cause 
the management and analysis issues [7]. As the data 
from whole genome sequencing projects increasing 
and coming along with complexities, it is enhancing 
the need of more powerful computational techniques 
to be used in bioinformatics [8-10]. Artificial 
intelligence and machine learning techniques 
transforming the computational powers, as they were 
developed to solve problems intelligently, following 
the concept of machine having human intelligence. 
Simulation of different models, biological sequence 
annotations, computational drug designing, virtual 
screening, binding site prediction and gene prediction 
can be effectively predicted through the use of AI 
and machine learning algorithms in bioinformatics. 
The probabilistic powers of aforementioned 
techniques is of great essence in bioinformatics to 
solve the complex biological problems [11-17] 


Decision Trees for Classification: A 
Machine Learning Algorithm 


In bioinformatics, researchers are designing efficient 
algorithms to analyze gigantic amount of data 
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[18,19]. The microarray data contains information of 
thousands of genes on their expression profiles in 
different conditions [20-24]. The expression profiles 
can be used to detect and track the ailments and 
response of patients to medication. It needs to 
develop a technique that can detect similar 
expressions and make a cluster of sick and healthy 
people (Fig. 1) [25-33]. The decisions trees are 
helpful in these conditions to provide suitable 
classification results. It is an algorithm that gets an 
input object and in return shows an output based on 
decision. It has nodes on it and each node is 
responsible to test the attribute of an input. The 
branches on the nodes show correspondence to the 
values which are possible for the attributes. The leaf 
nodes repeat the values. It is a tool that supports 
decision by using a tree-like model. Sorting rules are 
characterized in tracks from root to leaf. It has three 
types of nodes as, decision nodes-are usually denoted 
by squares, chance nodes are in circles and end 
nodes-by triangles [34-38]. 


Decision tree is basically a predictive modeling 
approach. This approach is used to construct a 
decision tree that finds different ways of splitting a 
data set grounded on various conditions. Decision 
trees where the board variable can take continuous 
qualities (regularly genuine numbers) are called 
regression trees, also known as a classification and 
regression tree (CART). It is one of the immensely 
used and applied methodologies in machine learning. 
Decision Trees are a non-parametric regulated 
learning procedure utilized for collaboration 
arrangement and regression processes. Decision tree 
gives data about gene interaction by stepwise parting 
of the data set — each split uncovers one quality and 
the progressive structure shows the nature of the 
interactions [39-44]. 

The building blocks of the decision tree includes; 
elements of decision tree: no sink nodes are in it 
however the decision tree only consists of burst 
nodes. Consequently, it can grow and turn enormous 
and that is the reason they are customarily strenuous 
to draw physically. The decision tree can be 
integrated into the principles of decision, where the 
result 1s the contents of the leaf node, and if the 
conditions along the path make a result in the clause. 
Generally, there is a form of rules; If condition 1, 
condition 2 and condition 3 then outcome. Decision 
regulations can be generated by building link 
regulations on the right side with the board variable. 
The casual or temporal relation can also be 
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Genotype a Se aa 


ENVIRONMENT 


Environment 1 


DISEASED HEALTHY 


Genotype BB/Bb 
Environment 2 


LOCU B 


— bb 


ENVIRONMENT 


Environment 4 


my 


DISEASED HEALTHY 


Fig. 1: A simple decision tree. Disease can be predicted from the genotype and environment, e.g., an individual in 


environment-4 with genotype aabb will be diseased. 


represented. The approach for decision tree has 
different types of queries at each node. The 
information gains correspondence to the queries to 
calculate. The gained information is used as an 
anchor to decide the feature at each step. The small 
tree splits at each step. The obtained information can 
be described as a measure generally used for purity. 
The particular amount of information gained for a 
class is measured by using the information value. 
The highest gained information is used to select the 
first split and the process is continued till the level is 
reached where children are pure, or the information 
value turns to 0 [45-49]. 

Pure denotes the sample chose from a data set fits to 
the same class while the impure data does not belong 
to the same class in fact it is the blend of various 
classes. Gini impurity is a parameter for the 
measurement. It is used to measure the probability of 
an unfitting classification of a new case variable. It 
depends upon the dataset if it is pure the chance of 
improper classification is zero. The likelihood of 
improper classification is higher if the sample is a 
blend of various classes. The basic steps of decision 
tree include, all the vagueness of dataset is 
calculated. A list is produced containing all the 


questions that need to inquire at the nodes. On the 
basis of questions inquired at each node, rows are 
divided as true and false. The information obtained 
from the division of data and gini impurity is 
calculated. The maximum information gained on the 
basis of calculations is then updated. The question on 
the base of information obtained is updated and the 
partition of the nodes has been done on the basis of 
questions. These steps are repeated from the 
beginning in a loop until the pure nodes are obtained. 
Decision tree is simple to understand even with a 
transitory insight of its model. For different 
situations, it aids in governing the best, worst and 
expected values. The stumbling blocks of decision 
trees inclined to over fitting. It needs some sort of 
measurement to keep checking the work. [50-53]. 


K. nearest neighbor (KNN) 


The k. nearest neighbor is a simple and easy to 
implement supervised machine learning algorithm. It 
can be used to solve both classification regression 
analysis problems. This algorithm works to find an 
entity from knowing its neighbor. It is an instance- 
based learning. KNN has different example-based 
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reasoning, lazy learning and instance-based learning. 
In regression, KNN helps to analyses the average 
values. It classifies new data points by checking 
similarity and placing it in a group containing closest 
neighbors. It has powerful classification algorithm 
that is used in pattern recognition. This algorithm 
uses no assumptions, non-parametric and lazy 
learning algorithm. It is first trained to encounter the 
problem. It does not decide itself it works by pulling 
the stored information and using it to solve the issues 
whether it is classification or regression problem 
regarding new data point [54]. For classifying a new 
data point, it uses voting system for its neighbor 
classes. It is measured by a distant function. The new 
object is assigned to the group containing closest 
neighbors (Fig. 2). 


START 


COMPUTE THE DISTANCE 
BETWEEN INPUT SAMPLE AND 
TRAINING SAMPLE 


TAKE THE NEAREST NEIGHBOURS 


APPLY SIMPLE MAJORITY 


END 


Fig. 2: KNN classifier algorithmic steps 


Genetic Algorithm 


The genetic algorithm is a heuristic search method 
used in artificial intelligence. It is used to find better 
solutions to search problems based on the theory of 
natural selection and evolutionary biology. Genetic 
algorithms are mostly recommended for searching 
through large and intricate data sets. The concept of 
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genetic algorithm relies on the concept of evolution. 
It generates samples and work based on_ the 
phenomena of survival of the fittest [56]. It 
consistently tries to improve it as well by selecting 
the fit ones and rejecting the weak ones. The 
population of the selected ones are increased day by 
day while the unselected are decreased in numbers. 
Nature has adapted this method of experimentation 
for the sake of improvement. The natural method of 
selection is based on various bio operators including 
mutations and crossing over which are also used in 
genetic algorithm. Genetic algorithm mimics the 
natural process of selection of fit and generation of 
off springs. It also considers the fact of robustness 
like nature. Genetic algorithm shuffles the genes and 
moves from simplicities to complexities for the sake 
of optimization. The genetic algorithm consists of 
sequence of events used to find the suitable solution. 
The events include initialization to look for a 
population which can serve as initial population from 
which the fittest can be selected later. The purpose of 
selecting this parent population is to generate an 
offspring population. This population contains 
candidates which serve as variety of genetic 
materials. The fitness function is the selection of the 
fittest parent population to generate offspring 
population. It is done by evaluating parent population 
by fitness function which is similar to heuristic 
approach selection. The fittest population is selected 
to generate the offspring population based on the 
value of fitness function. The population is selected 
based on the value assigned by fitness function in 
order to mimic the natural phenomena of survival of 
the fittest. Cross overs are applied by swapping the 
tail of parent population. The bio operator used to 
swap the bits. The genetic representation in this 
program has array of bits. This sequence of events 
works like a loop. They are continued until a 
constant value of the generation 1s achieved. Once it 
starts getting constant values, it results in the 
termination of algorithm [57]. 


Artificial Neural Networks (ANNs) 


Artificial neural networks (ANN) is the key tool of 
machine learning. ANNs is a representation of the 
human brain function in the form of computational 
models. ANN is a machine learning algorithm for the 
pattern recognition. It is composed of input layer, 
hidden layer and output layer. Nodes are referred as 
the units of information in each layer. Nodes of 
different layers are interconnected to form a network 
which resembles the natural biological system [51- 
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54]. The nodes are mathematical weight constraints 
that can be trained with known patterns and can be 
used later for predictions. Once the network has been 
trained, it is capable of identifying the 
interconnection between the input and the output. 
The process of learning in humans has some minute 
regulations to the synaptic networks among neurons. 
Contrarily, ANNs learning phase is grounded on the 
associations among the processing elements that set 
up the network topology. ANNs have become one of 
the requisite gears in bioinformatics. It was fueled 
through the expansion and shoot up of many 
biological databases, which store the data correlating 
to RNA and DNA sequences some proteins and other 
macromolecular structures. Massive amount of data 
requires the use of computational tools to deal with 
its complexity to retrieve information of interest [58- 
61]. The diverse uses of neural networks include 
predictive modelling, classification and also for the 
identification of biomarker within datasets of high 
complexity and desire. The three primary forms of 
ANNs are radial basis function networks, the 
multilayer perception and recurrent neural networks 
[62-64]. 


Multilayer Perceptron (MLP) 


It is multilayer perceptron (feedforward) artificial 
neural network and ANNs are arranged into multiple 
layers, corresponding neurons present in each layer 
or processing elements to make layers. (Fig. 3). 
ANNs have alike topology consisting of an input 
layer, one or more hidden layers and an output layer. 
The complexity of the problem is resolved by the 
quantity of input neurons as it determine the number 
of hidden layers. The input layer networks along with 
outer environs collect the data as a vector of 
interpreter variables, each regarded as a node. An 
output is produced which is the sum of all the 
products processed by a non-linear transfer function. 
The MLP helps in secondary structure of protein, 
relatively solvent accessibility of proteins residues, 
binding residues, transmembrane regions and 
quantitative traits from genotype data [65-68]. 


Recurrent Neural Network (RNN) 


Recurrent neural networks are designed for 
sequential or time-series data. The modification in 
the MLP structure generates a recurrent neural 
network. A context layer is present which serves the 
purpose of holding the figures through observations. 
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Every iteration has a new feature vector is thrust into 
the input layer. The preceding contents of the hidden 
layer are reprinted to the context layer and shoved 
into the hidden layer in the next iteration. The RNN 
processes include the input value into the input 
nodes. The calculation of net inputs from input nodes 
and from the nodes present in the context layer. The 
computing of hidden nodes and activation from net 
input Lead to calculate the output node activations. 
The use of back propagation algorithm to compute 
the new weight values and the insertion of new 
hidden weights in the context layer are also analyzed. 
The weight between the input layer and the hidden 
layer is taken by using the similar method MLP. To 
find the error values, the weights between the context 
and the hidden layer play a vital role as the error 
values depends upon the hidden nodes, received at 
the tn iteration (Fig. 4). 

RNN architecture can be successfully applied in the 
prediction of B-turns, secondary structure of proteins, 
number of residue contacts, continuous’ B-cell 
epitopes, binding sites of transcription factors and 
sequential phenotype prediction in genomics [69-71]. 


Radial Basis Function Neural Networks 


Radial basis function (RBF) neural networks 
comprises of three layers as an input layer, a hidden 
layer (feature vector having a non-linear RBF 
activation function) and a linear output layer. Before 
the problem is determined and the classification is 
done, non-linear transfer function is applied to the 
hidden layer. The linear differentiation of the hidden 
layer grows by increasing the dimension of the 
hidden layer (Fig. 5). 

In the hidden layers, the vectors are organized onto 
each RBF. It is usually applied as a Gaussian 
function. The values of the center and spread are set 
up by the aid of training dataset; the utilized method 
contains K means clustering or a random subset of 
the training vectors. While considering the regression 
hurdles, the output layer is a linear arrangement of 
values generated by the hidden layer, which 
associates to the mean forecasted output. In 
prediction hurdles, the output layer is conventionally 
accomplished by the aid of a sigmoid function of a 
linear amalgamation of hidden values layer values, 
denoting a posterior likelihood. 

The RBF neural networks can be successfully 
applied in the prediction of inter-residue contact 
maps, cleavage sites of proteases and targets for 
protein-targeting compounds in drug designing [72- 
76]. 
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Fig. 3: Architecture of a typical multi-layered perceptron artificial neural network. 
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Fig. 4: Architecture of RNN. Like deep neural networks, RNN also contains an input layer, a hidden layer, and an 


output layer. An additional context layer is connected to the hidden layer. 
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Fig. 5: Architecture of RBF neural network. 


Conclusion 


Artificial intelligence and machine learning are 
strengthening the field of computational biology and 
bioinformatics. As the volume of biological data is 
increasing, AI and machine learning promise great 
support in managing and performing analysis. The 
efficient prediction methods provide a great deal of 
support for researchers in the early diagnosis of 
diseases. 
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